White Wine Data Exploration

by Sam Morrow

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   2   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1227   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2451   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2452   Mean   : 6.854   Mean   :0.2781   Mean   :0.3334  
##  3rd Qu.:3676   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3800  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.000   Median :0.04300   Median : 34.00     
##  Mean   : 6.148   Mean   :0.04568   Mean   : 35.22     
##  3rd Qu.: 9.600   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :17.950   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH         sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.72   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.09   1st Qu.:0.4100  
##  Median :133.0        Median :0.9937   Median :3.18   Median :0.4700  
##  Mean   :137.9        Mean   :0.9939   Mean   :3.19   Mean   :0.4902  
##  3rd Qu.:167.0        3rd Qu.:0.9959   3rd Qu.:3.28   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0024   Max.   :3.82   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.53   Mean   :5.885  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.25453969      0.002809863
## fixed.acidity        -0.254539689    1.00000000     -0.024941981
## volatile.acidity      0.002809863   -0.02494198      1.000000000
## citric.acid          -0.146465102    0.28743510     -0.161408157
## residual.sugar        0.012545851    0.08951534      0.050190875
## chlorides            -0.048043133    0.02434731      0.071430730
## free.sulfur.dioxide  -0.008737141   -0.05059463     -0.093568972
## total.sulfur.dioxide -0.161283394    0.08974740      0.092707018
## density              -0.194756843    0.27568774      0.006473784
## pH                   -0.119515777   -0.42827622     -0.034870905
## sulphates             0.007607229   -0.01611827     -0.038408212
## alcohol               0.214087328   -0.12297629      0.063132761
## quality               0.032062686   -0.11376730     -0.198627115
##                       citric.acid residual.sugar   chlorides
## X                    -0.146465102     0.01254585 -0.04804313
## fixed.acidity         0.287435102     0.08951534  0.02434731
## volatile.acidity     -0.161408157     0.05019088  0.07143073
## citric.acid           1.000000000     0.08135806  0.11810290
## residual.sugar        0.081358056     1.00000000  0.08170230
## chlorides             0.118102903     0.08170230  1.00000000
## free.sulfur.dioxide   0.091529288     0.31502227  0.10176926
## total.sulfur.dioxide  0.116471355     0.40652142  0.19885721
## density               0.144726507     0.82040112  0.26263441
## pH                   -0.165932129    -0.19055836 -0.09076344
## sulphates             0.064584287    -0.02303271  0.01656835
## alcohol              -0.079091027    -0.45178076 -0.36152981
## quality              -0.006495807    -0.08415482 -0.20784465
##                      free.sulfur.dioxide total.sulfur.dioxide      density
## X                           -0.008737141          -0.16128339 -0.194756843
## fixed.acidity               -0.050594628           0.08974740  0.275687737
## volatile.acidity            -0.093568972           0.09270702  0.006473784
## citric.acid                  0.091529288           0.11647136  0.144726507
## residual.sugar               0.315022273           0.40652142  0.820401125
## chlorides                    0.101769259           0.19885721  0.262634409
## free.sulfur.dioxide          1.000000000           0.61377314  0.308932572
## total.sulfur.dioxide         0.613773142           1.00000000  0.543326254
## density                      0.308932572           0.54332625  1.000000000
## pH                           0.003866531           0.01068473 -0.086258076
## sulphates                    0.059736861           0.13396328  0.081269157
## alcohol                     -0.247918033          -0.44449032 -0.807752697
## quality                      0.011901107          -0.17134265 -0.312454677
##                                pH    sulphates     alcohol      quality
## X                    -0.119515777  0.007607229  0.21408733  0.032062686
## fixed.acidity        -0.428276224 -0.016118267 -0.12297629 -0.113767296
## volatile.acidity     -0.034870905 -0.038408212  0.06313276 -0.198627115
## citric.acid          -0.165932129  0.064584287 -0.07909103 -0.006495807
## residual.sugar       -0.190558361 -0.023032706 -0.45178076 -0.084154817
## chlorides            -0.090763444  0.016568351 -0.36152981 -0.207844646
## free.sulfur.dioxide   0.003866531  0.059736861 -0.24791803  0.011901107
## total.sulfur.dioxide  0.010684731  0.133963276 -0.44449032 -0.171342647
## density              -0.086258076  0.081269157 -0.80775270 -0.312454677
## pH                    1.000000000  0.158215782  0.11472929  0.097288873
## sulphates             0.158215782  1.000000000 -0.01838456  0.052596089
## alcohol               0.114729292 -0.018384558  1.00000000  0.433607807
## quality               0.097288873  0.052596089  0.43360781  1.000000000

Univariate Plots Section

Univariate Analysis

What is the structure of your dataset?

The data is very clean, and is ordered into rows with single measured observations featuring: - Fixed Acidity - Volatile Acidity - Citric Acid - Residual Sugar - Chlorides - Free Sulfer Dioxide - Total Sulfer Dioxide - Density - pH - Sulphates - Alcohol

Each row then has a quality rating, which has been gauged by at least 3 experts.

What is/are the main feature(s) of interest in your dataset?

  • Quality
  • Alcohol
  • Density

I’m interested in trying to model wine quality, by looking at correlations between quality and other features, so quality is my primary feature of interest. Alcohol has the strongest correlation (0.436) to quality, so I will have to factor that into my investigations (in spite of my subjective experience that there are quality wines of many strengths), as this is the strongest correlation in the the dataset, one other thing is clear: no single variable is a strong enough indicator of quality to.

I am surpised that density has second the strongest correlation to quality. After some reflection, I think that density may represent the balancing of other features (as sugar content etc. affect this property), which could help constructing a model, however I am unconvinced that humans can conciously perceive such minor fluctuations in density.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The features with significant correlation to quality (other than Density are Alcohol) are Chlorides, Volatile Acidity and Total Sulfur Dioxide.

Did you create any new variables from existing variables in the dataset?

I’ve created a boolean column for high and low quality wines, so that I could generalise about the data in different ways.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I found that wine with a residual sugar >= 22g/l was producing much less predicable results. We do not have many data points above this value, and the wines vary dramatically in residual sugar beyond this point, so I have made an executive decision to filter it out. I looked into wine sweetness and wine between 18 to 45 g/l is considered to be medium - so I am limiting my analysis to dry - medium dry wine and using 18 g/l as the cutoff. I think if we were to seriously analyise sweeter wines, we would need more data.

The distribution of residual sugar on the histogram is interesting, because it is roughly positively skewed.

Sulphates seem to be bi-modal while alcohol is in a sense multi-modal, and my guess here is that there are different classes / categories of wine, targeting certain areas, and far from a coincidence, these values might be inidicative of modal wines in different genres (or with different preservative requirements). It would be interesting to have had more categorical data for the wines, to analyse subsets, as well as white wine in general.

Bivariate Plots Section

## 
## Call:
## lm(formula = wine$quality ~ wine$density + I(wine$density^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1926 -0.6053  0.0921  0.4150  3.4251 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          17829       1370   13.01   <2e-16 ***
## wine$density        -35754       2756  -12.97   <2e-16 ***
## I(wine$density^2)    17932       1386   12.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8299 on 4810 degrees of freedom
## Multiple R-squared:  0.128,  Adjusted R-squared:  0.1276 
## F-statistic: 352.9 on 2 and 4810 DF,  p-value: < 2.2e-16
## 
## Calls:
## m1: lm(formula = wine$quality ~ wine$density + I(wine$density^2))
## m2: lm(formula = wine$quality ~ wine$density + I(wine$density^2) + 
##     wine$alcohol + I(wine$alcohol^2) + I(wine$alcohol^3) + I(wine$alcohol^4))
## m3: lm(formula = wine$quality ~ wine$density + I(wine$density^2) + 
##     wine$alcohol + I(wine$alcohol^2) + I(wine$alcohol^3) + I(wine$alcohol^4) + 
##     wine$residual.sugar)
## m4: lm(formula = wine$quality ~ wine$density + I(wine$density^2) + 
##     wine$alcohol + I(wine$alcohol^2) + I(wine$alcohol^3) + I(wine$alcohol^4) + 
##     wine$residual.sugar + wine$volatile.acidity)
## 
## ===================================================================================
##                               m1             m2             m3            m4       
## -----------------------------------------------------------------------------------
##   (Intercept)             17828.653***    5674.429***   3697.429*      439.918     
##                           (1370.216)     (1657.444)    (1657.013)    (1604.950)    
##   wine$density           -35754.517***  -11204.143***  -7134.415*     -699.446     
##                           (2756.356)     (3336.875)    (3337.274)    (3231.744)    
##   I(wine$density^2)       17931.604***    5646.941***   3545.262*      316.367     
##                           (1386.178)     (1677.259)    (1678.273)    (1625.165)    
##   wine$alcohol                             -38.487*      -36.263*      -16.768     
##                                            (18.657)      (18.497)      (17.845)    
##   I(wine$alcohol^2)                          4.855         4.666         1.975     
##                                             (2.558)       (2.536)       (2.446)    
##   I(wine$alcohol^3)                         -0.265        -0.261        -0.100     
##                                             (0.155)       (0.154)       (0.148)    
##   I(wine$alcohol^4)                          0.005         0.005         0.002     
##                                             (0.003)       (0.003)       (0.003)    
##   wine$residual.sugar                                      0.052***      0.050***  
##                                                           (0.006)       (0.005)    
##   wine$volatile.acidity                                                 -2.181***  
##                                                                         (0.113)    
## -----------------------------------------------------------------------------------
##   R-squared                     0.1            0.2           0.2           0.3     
##   adj. R-squared                0.1            0.2           0.2           0.3     
##   sigma                         0.8            0.8           0.8           0.8     
##   F                           352.9          202.9         189.1         225.2     
##   p                             0.0            0.0           0.0           0.0     
##   Log-likelihood            -5930.3        -5716.6       -5674.3       -5493.5     
##   Deviance                   3312.6         3031.0        2978.2        2762.7     
##   AIC                       11868.6        11449.1       11366.5       11007.0     
##   BIC                       11894.5        11500.9       11424.8       11071.8     
##   N                          4813           4813          4813          4813       
## ===================================================================================
## [1] 0.522221

## [1] 3292

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The density appears to decrease with alcohol, interestingly, there seems to be less variation in the higher alcohol area.

The median alcohol level increases almost linearly with medium quality wines, but goes the other way below that. Perhaps stronger wines are more harder to get right, but weaker wines are often quite average. The IQR certainly shows an interesting trend, although there are many outliers, and lots of overlap between the whiskers at each quality.

Many low quality wines have a high denisty, and most high quality wines have a lower density. The variance is high here, so it’s hard to draw strong conclusions.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There is a negative correlation between pH and fixed acitidy. The surprising part for me is how weak a corellation this is. I expected it would be a different measure for effectively the same value.

The amount of chlorides seem to increase with residual sugar (particularly the minimum chlorides). I would hypothesise that perhaps the salt content is increased to balance sweetness. I did not expect much correlation here so this is interesting.

What was the strongest relationship you found?

Density and alcohol have a very strong relationship. I think this must be a yeast / sugar issue, where the stronger the alcohol the more sugar was consumed by the yeast, and the thinner the wine would be (or as vinters calculate when making wine, it would have a lower specific gravity). Starting gravity (i.e. initial sugar levels) and different strains of yeast would account for much of the variance, and environmenal factors such as fermentation temperature would likely also have an impact, but even with these there is an obvious relationship.

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Quality seems very distributed when using it to colour alcohol V.S. density, in particular medium quality seems to be extremely varied. There is still a definite section of stronger alcohol, and lower density that is the largest area of high ratings.

When we remake the same plot, using the quality prediction model outputs, we can see effectively a charicature of the effect above, where it is too simplistic to genuinely reflect reality, but you can see significatn similarities.

Residual sugar and density also seem to produce a very clear distribution, and again very closely matched by the model aesthetically.

Were there any interesting or surprising interactions between features?

Too much total sulpher dioxide seems to reduce the chances of a wine being rated as high quality.

Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created a linear model to predict quality, with a correlation of 0.522 to quality using density, alcohol, residual sugar, volatile.acidity.

The model for denisty was a 2 degree polynomial, and for alcohol I used a 4 degree polynomial (maybe overkill / overfitting here), and while that enabled more suble inflections in the resultant quality prediction (and a closer fit), it was still only a moderate increase in correlation from the correlation of denisty on its own (0.436). Ultimately this is a very weak model, and is not well suited to the task. I think it might be possible to improve it, and I should have separated some random sample data and test data, so I could at least have reduced the risk of over-fit.

Perhaps the subtlety of a neural network, could produce a more nuanced model than a linear one. The subjective rating wine quality is complex to model.

The main strength of this model is that when reduced to a boolean prediction (of quality > 5 or <=5) it gets it right 3292 / 4813 in this dataset. So with ~68% accuracy this model could be used, for example, to help position / price wine in a shop on the liklihood it would get an above or below average rating.


Final Plots and Summary

Plot One - Mean Alcohol VS. Residual Sugar (Grouped by Quality Range)

Description One

We can clearly see that when we show mean alcohol by approximate quality, over residual sugar, that at the low end of residual sugar, stronger wines are generally of a higher quality. As we go up the residual sugar scale, these averages converge. We can’t read too much into this except to say that as wine gets sweeter, strength becomes less of an indicator of quality. On both lines we see a negative trend for strength and sweetness, but this hides so much of the variance, I’m not sure that’s useful.

The key take-away is that if we are going to make generalisations about wine, we might want to categorise it further. It’s also fair to say that each axis here provides a degree of smoothing, and generalisation is the operative word here. There are many exceptions to this rule.

Plot Two

Description Two - Alcohol VS. Density

Strong, less dense wine is more likely to be of a higher quality. That was the main outcome of my research. The indicators aren’t very strong, there’s a lot of medium quality wine everywhere which makes it very difficult to make assumptions.

Plot Three - Residual Sugar VS. Density - With Quality / Predicted Quality

Description Three

These plots amazed me a little at first. The combined strength of residual sugar and density as predictors for quality seems very strong. We can see a lot of similarity between both the real quality and the predicted quality. There is still a huge amount of noise, and lots of grey points, but we see a clearer picture here.


Reflection

I think I’ve been able to create an interesting, although flawed model for predicting wine quality. I’ve been able to show that there are some real correlations in the data. I’m happy with those aspects of my analysis.

My biggest challenge was trying to find stronger predictors than any that actually existed in the data. Near 70% accuracy of “is it good?”, “is it bad?” was still a positive outcome, but I imagined that the detailed data on wine chemistry would be enough to overcome the subjectivity of the quality ratings. It seems there is still a little je ne sais quois in wine quality, that goes beyond our dataset.

I also cannot decide if filtering the data was a good step or a bad step. The combination of the data being scant in the high-sugar range, with it being more extreme lead me to believe it was not worth analysisng at this stage. I’m confident it helped me to draw more useful conclusions, so I think it wasn’t simply a case of convenience.

My model was very moderate! I am sad about this. It was not able to predict quality below 4 or above 7. This is in some ways to be expected, but it is very much a symptom of the variability / non-linear nature of the real data.

I think I could have possibly handled the interval quality data better too. Continuous values are easier to model, and a linear model will naturally produce values in between the outputs.